Mining Text for Word Senses Using Independent Component Analysis

نویسنده

  • Reinhard Rapp
چکیده

The assumption that the problem of ambiguity in text analysis can only be solved if statistical dependencies of higher than second order are considered leads us to independent component analysis (ICA), a statistical formalism that takes higher-order dependencies into account. By assuming independence, ICA is capable of detecting a set of hidden vectors if only different linear mixtures of these vectors are observable. As a test case for ICA’s applicability to natural language processing we look at the task of word sense induction. Our starting point is that we consider the co-occurrence vector of an ambiguous word as a linear mixture of its unknown sense vectors. If corpora from different domains are available, this should give us the different linear mixtures that are required for ICA. It turns out that the independent sense vectors derived by ICA from the distributional differences of word usage reflect a word’s meanings surprisingly well.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Word Senses from Text for Corpus-Based Lexicography

This paper discusses the problem of automated lexicography. In the corpus-based approach, a lexicographer has to manually group contexts of a target word into clusters in order to identify word senses. When a large number of the contexts is given, this process becomes a tedious and time-consuming task. To overcome this problem, we propose an efficient technique based on unsupervised clustering....

متن کامل

Unsupervised Text Mining

We describe the results of performing text mining on a challenging problem in natural language processing, word sense disambiguation. We compare two methods of unsupervised learning, Ward's minimum{variance clustering and the EM algorithm, that distinguish the meaning of an ambiguous word based only on features that can be automatically identiied in text. This is a signiicant advantage over mos...

متن کامل

Optimization of Word Sense Disambiguation Using Clustering in Weka

In the Natural Language Processing (NLP) community, Word Sense Disambiguation (WSD) has been described as the task which selects the appropriate meaning (sense) to a given word in a text or discourse where this meaning is distinguishable from other senses potentially attributable to that word. These senses could be seen as the target labels of a classification problem. Clustering and classifica...

متن کامل

KSU KDD: Word Sense Induction by Clustering in Topic Space

We describe our language-independent unsupervised word sense induction system. This system only uses topic features to cluster different word senses in their global context topic space. Using unlabeled data, this system trains a latent Dirichlet allocation (LDA) topic model then uses it to infer the topics distribution of the test instances. By clustering these topics distributions in their top...

متن کامل

A Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering A Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering

Increasingly large text datasets and the high dimensionality associated with natural language is a great challenge of text mining. In this research, a systematic study is conducted of application of three Dimension Reduction Techniques (DRT) on three different document representation methods in the context of the text clustering problem using several standard benchmark datasets. The dimensional...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004